print("team member: ")
print("long zhang" + " " +"47778605")
print("zhengwen zhang" + " "+ "47502277")
print("qinyu chen"+ " " + "47813055")
As is known to all that heart disease is the number one death causing diseases in US and other developed countries. It is interesting to learn which risk factors contribute to heart disease.From 1948 to now, the Framingham has been study heart disease for about 70 years. During 70 years, the Framingham Heart Research has continuously refreshed human understanding of heart health and diseases. The Framingham Heart Research has continued for three generations and published 3698 articles. Framingham's sign proudly says two words: "A small town that changed the heart of Americans" and "Framingham Heart Research House".
The "Logistic regression-To predict hear disease dataset" is publically available from kaggle,it is obtained from a ongoing study carried out at a town called Framingham,MA. The original dataset contains over 1000 rows(4238) and 16 columns,the table data can be downloaded for free and is ready to use immediately.it contains categorical features(e.g., male/female,current smoker or no, the goal of classification is to make prediction about whether the patient has 10-year risk of future CHD(last column), so this data set matches all the requirements for Lab one.
The reason why doctors or healthcare provider would be interested in the results is they can gain a better understanding of the risk factors associated with heart diseases, so they can make recommendations to existing patients on how to reduce the risk or complications.based on our model, they can also make predictions for new patients with the survey data collected. For patients, they can change their life style based on the prediction results accordingly.
Once we begin modeling, we will do some visualizations about the relationship of different features. Also, we will do PCA dimension reduction which greatly reduces the complexity of calculation, reduces the recognition errors caused by redundant information, and improves the accuracy of recognition. These algorithms will be useful to hird parties or researchers.
Dataset: heart disease prediction URL: https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression
Question Of Interest: which patient will develop heat disease within 10 years?
#import the libraries
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
#load the data set downloaded to local directory, adding "r" for unicode error in windows OS
#chd = pd.read_csv(r"C:\Users\tony\Dropbox\smu\CSE7324\assignment\LabOne\framingham.csv")
chd = pd.read_csv("/Users/chenqinyu/Desktop/framingham.csv")
#chd = pd.read_csv("/Users/terrence/Desktop/framingham.csv")
#preview of the first 5 observations
chd.head()
#chd1= chd.copy()
#get the data summary
chd.describe()
print(chd.info())
#data types
As there are 4238 rows in the excel spread sheet, this data frame shows that some columns have less than 4238 counts, so there are some missing values for these observations.these features are either int64 or float64,so all data are numerical, there is no need to replace categorical data to numerical data
# creating a data description table
chd_des = pd.DataFrame()
chd_des['features']=chd.columns
chd_des['descriptions']=['sex','age','education','current smoker or not','number of cigarettes per day','blood pressure medications','have a stroker before','have hypertension before','disbetes','total cholesterallevel','systolic blood pressure','diastolic blood pressure','Body Mass Index','heart rate','glucose level','10 year risk of coronary heart disease CHD']
chd_des['scales']=['nominal']+['continous']+['nominal']+['continous']+['nominal']*5+['continous']*6+['binary']
chd_des['range']=['1:male;0:female']+['0-100']+['1-4']+['0:No;1:Yes']+['0-70']+['0:No;1:Yes']+['0:No;1:Yes']+['0:No;1:Yes']+['0:No;1:Yes']+['0-696']+['0-295']+['0-142.5']+['0-56.8']+['0-143']+['0-394']+['0:No;1:Yes']
chd_des
Here we have all 16 attributes information. Additionally, we can clearly see the descriptions, scales and ranges. From our common sense, all the features are important factors in heart research. Some fearures have relationship directly. For example, if a peason isn't a current smoker, his or her cigsPerDay value is sure to be 0.
import missingno as mn
mn.matrix(chd)
From the visualization above, we find features( education, cigsPerDay, BPMeds, totChol, BMI, glucose) have missing value. Some people learn by themselves, so it is hard to classify which group they are. As for features like BMI, BMI is a value that hard to measure, or it is inconvenient to measure. Also do other features which have missing value.
We know from dataset description that there are some missing values. First of all, let's check how many features are missing for all observations.
print (chd.isnull().sum())
#checking missing values amount of each features in this dataset
The results above suggest that 7 out of 16 features are having missing values, for example, 105 out of 4238 observations are missing 'education' information,next, We try to impute the missing data.
chd["education"].describe()
chd.education.value_counts()
#according to the statistic of education and the amount of distinct values of education, we could imputate the 105
#missing education values to "2.0"
Next,we do the same thing for the next feature with missing values
chd["cigsPerDay"].describe()
chd.cigsPerDay.value_counts()
#according to the statistic of cigsPerDay and the amount of distinct values of cigsPerDay, we could imputate
#the 29 missing cigsPerDay values to mode "0"
chd.BPMeds.value_counts()
#according to the counts of the distinct values in Blood pressure, we could know that most people are "0",
#which means most of people have no blood pressure medication, so imputate the 53 missing BPMeds values to mode "0"
chd["totChol"].describe()
#according to Statistics of total alcohol we could know that most people are within the range:200-300,
#so imputate the 50 missing totchol values to mean "236.721585"
chd["BMI"].describe()
#according to Statistics of BMI, we could know that most people are within the range:20-30,
#so imputate the 19 missing BMI values to mean " 25.802008"
chd["heartRate"].describe()
#according to Statistics of heartRate, we could know that most people are within the range:70-80,
#so imputate the 1 missing heartRate value to mean " 75.878924"
chd["glucose"].describe()
#according to Statistics of glucose, we could know that most people are within the range:70-80,
#so imputate the 388 missing glucose value to mean " 81.966753"
Replacing the missing values with the above Mentioned values and coverting their data type to their original data type(float)
import copy
chd1= copy.deepcopy(chd)
#imputing null education with 50% value 2.0
chd["education"].fillna(2.0, inplace = True)
chd['education'] = chd['education'].astype(float)
chd.education.plot(kind='hist',alpha=0.5)
chd1.education.plot(kind='hist',alpha=0.5)
plt.show()
#imputing null cigsPerday with mode '0'
chd["cigsPerDay"].fillna(0, inplace = True)
chd['cigsPerDay'] = chd['cigsPerDay'].astype(float)
#imputing null BPMeds with mode '0'
chd["BPMeds"].fillna(0, inplace = True)
chd['BPMeds'] = chd['BPMeds'].astype(float)
#imputing null total cholecteral value with mean value of 236.721585
chd["totChol"].fillna(236.721585, inplace = True)
chd['totChol'] = chd['totChol'].astype(float)
#imputing null BMI value to be the mean of 25.802008
chd["BMI"].fillna(25.802008, inplace = True)
chd['BMI'] = chd['BMI'].astype(float)
#imputing null heart rate value to be the mean of 75.878924
chd["heartRate"].fillna(75.878924, inplace = True)
chd['heartRate'] = chd['heartRate'].astype(float)
#imputing null glucose value to be the mean of 81.966753
chd["glucose"].fillna(81.966753, inplace = True)
chd['glucose'] = chd['glucose'].astype(float)
#preview the table after imputation
chd
#this is to confirm that the imputation are successful for all features
print(chd.info())
#this is to another confirmation that the imputation are successful for all features
print (chd.isnull().sum())
chd1 = chd.copy()
chd1.head()
#check if there is any duplicates
idx=chd.duplicated()
len(chd[idx])
no duplicates have been found
chd2 = chd.iloc[:,0:15]
#creat a dataframe which only has features, no target in it.
chd2.head()
# Lets aggregate by age and count chd count
chd_grouped_age = chd.groupby('age').TenYearCHD.sum()
chd_grouped_age.plot(kind='bar')
plt.xlabel('Age')
plt.ylabel('counts of chd')
plt.title('relationship between age and chd')
plt.show()
We plot the chd counts versus age, obviously, increase in age shows increasing odds of having chd.
# Lets aggregate by cigsPerDay and count chd count
chd_grouped_age = chd.groupby('cigsPerDay').TenYearCHD.sum()
chd_grouped_age.plot(kind='bar')
plt.xlabel('cigsPerDay')
plt.ylabel('counts of chd')
plt.title('relationship between cigsPerDay and chd')
plt.show()
tenyears_chd = pd.crosstab(chd['TenYearCHD'], chd['male'])
print(tenyears_chd)
tenyears_chd.plot(kind='bar', stacked=True)
plt.xlabel("ten years coronary heart disease")
plt.ylabel("Number of Counts")
plt.title('Distribution of male')
plt.show()
As we can see from the above bar plot, for those who won't have heart disease in ten years, the porion of female is higher than that of male. for those who will have heart disease in ten years, the portion of male is slightly higher than that of female.
sns.violinplot("male", "sysBP", data=chd,
palette=["pink", "gray"]);
from the above plot, we could tell that the male's systolic blood pressure are a little bit higher the the female's, and the shape of male's voilin plot is more compressed and flat, which means that systolic blood pressure of male are more concentratedly ditrubuted around the median(around 125). And the range between upper adjacent value and lower adjacent value of male is slightly less than that of female.
sns.distplot(chd['BMI'])
plt.xlabel(' BMI')
plt.ylabel('Frequency')
plt.title('Frequency of BMI')
As we can tell from the above plot, the frequncy plot of BMI looks pretty much like the gaussian distribution. 25 BMI are the most frequent BMI values. 27 and 26 are the second and third most frequent BMI values
colors = ['blue', 'purple', 'green', 'yellow']
chd['education'].value_counts().plot.bar(color = colors)
plt.xlabel('education')
plt.ylabel('number of counts')
plt.title('level of education count of sample')
According to the bar plot of the eduction level count, we could tell in this data sample, most people are in the "1.0" level education, which is the lowest level of education. And the dataset follows a pattern which is the higher the education is, the less the corresponding amount of count is.
with sns.axes_style(style='ticks'):
g = sns.factorplot("education", "heartRate", "male", data=chd, kind="box")
g.set_axis_labels("education", "heartRate");
According to the above plot, the interesting thing is the heartrate median of female is always either equal or slightly higher that that of male in every education level. the heartrate median of male is about the same in education level of 1.0, 2.0, 3.0. the heartrate median of male in education level 4.0 is obviouly lower. And the upper quatile of female box plot with education level 1.0 and 2.0 are higher than that with education level 3.0 and 4.0.
chd['age_range'] = pd.cut(chd.age, [30,40,50,60,70], labels=['30yo-40yo','40yo-50yo','50yo-60yo','60yo-70yo'])
chd.head()
Hyp_chd = pd.crosstab(chd['prevalentHyp'], chd['age_range'])
print(Hyp_chd)
Hyp_chd.plot(kind='bar', stacked=True)
plt.xlabel("prevalent hypertensive")
plt.ylabel("Number of Counts")
plt.title('Distribution of age_range')
plt.show()
From the above bar plot, we could tell that for those who have prevalent hypertensive, the biggest age_range portion is 50yo-60yo, then followed by 40yo-50yo, 60yo-70yo, and there is very few portion for 30yo-40yo. for those who don't have prevalent hypertensive, the biggest age_range portion is 40yo-50yo, followed by 50yo-60yo, 30yo-40yo, and there is few portion for 60yo-70yo.
chd['heartRate_range'] = pd.cut(chd.heartRate, [40,60,100,150,
], labels=['low','normal','high'])
chd.head()
sns.violinplot(x="heartRate_range", y="TenYearCHD", hue="diabetes", data=chd,
split=True, inner="quart")
To solve the question, we could use violin plot. According to the above violin plot, wer could draw the conclusion as following: 1>. from the aspect of heartrate: the lower the heartrate is, the people less likely have CHD in ten year, and also less likey to have diabete. 2>. from the aspect of diabete: people who have diabete are always likely to have normal or high heartrate, and the chance of getting CHD is also getting higher. 3>. from the aspect of CHD: people with CHD are more likely to have diabete and high heartrate.
# set the plotting style
cmap = sns.set(style="darkgrid")
# set the figure size
f, ax = plt.subplots(figsize=(10, 10))
# exclude the NaN/null or object datatypes and plot the correlation map
sns.heatmap(chd2.corr(), cmap=cmap, annot=True)
f.tight_layout()
plt.title('Correlation Heatmap')
To solve the question of the correlationships between features, we could use correlation heatmap. from the above heat map and map legend, we could tell that the lighter the more postively related between two features, the darker the more negatively related between two features.
1> from the aspect of hypertensive: systolic blood pressure and diastolic blood pressure are strongly positively realted. And these two features are postively related to the hypertensive.
2> from the aspect of diabete: glucose is strongly postively related to diabete.
3> from the aspect of age: age is strongly negatively related to education level and ciggrates per day, which means in this sample the younger the more education level you have, the less amount of ciggrates you have per day.
import numpy as np
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
sns.set(style='white', context='notebook', rc={'figure.figsize':(10,7)})
chd.head()
#sns.pairplot(chd, hue='education')
PCA is a common data analysis method, commonly used for dimensionality reduction of high-dimensional data, and can be used to extract the main feature components of the data
#import the dataset
df = chd1
#our target is TenYearCHD
df['target'] = df.TenYearCHD.astype(np.int)
# Delete the column of target from our table
df = df.drop("TenYearCHD",axis=1)
# because different features have different units, we have to standarize the data
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
X = df.values
X = StandardScaler().fit_transform(X)
y = df.target
pca = PCA(n_components=2)
pca.fit(X) # fit data and then transform it
X_pca = pca.transform(X)
# print the components
print ('pca:', pca.components_)
Here, we get our convolution which we use in PCA.
import seaborn as sns
cmap = sns.set(style="darkgrid")
# the code is mentioned in class
def get_feature_names_from_weights(weights, names):
tmp_array = []
for comp in weights:
tmp_string = ''
for fidx,f in enumerate(names):
if fidx>0 and comp[fidx]>=0:
tmp_string+='+'
tmp_string += '%.2f*%s ' % (comp[fidx],f[:-5])
tmp_array.append(tmp_string)
return tmp_array
plt.style.use('default')
# Data Analytics
pca_weight_strings = get_feature_names_from_weights(pca.components_, df.columns)
# create some pandas dataframes from the transformed outputs
df_pca = pd.DataFrame(X_pca,columns=[pca_weight_strings])
from matplotlib.pyplot import scatter
# scatter plot the output, with the names created from the weights
ax = scatter(X_pca[:,0], X_pca[:,1], c=y, s=(y+2)*10, cmap=cmap)
plt.xlabel(pca_weight_strings[0])
plt.ylabel(pca_weight_strings[1])
#plt.figure(figsize=(10,20))
By using PCA, our data dimension is reduced from 16 to 2. From the plot, we find our dataset is divided into 2 cluster, although it is not very clearly. From the result, we can make a conclusion that TenYearCHD(our target) is related to many features. Some is highly related,and some is lowly related. It is hoped that the projection values after the projection are scattered as much as possible, because if the samples overlap, some samples disappear. Of course, this can also be understood from the perspective of entropy. The larger the entropy, the more information it contains. In order to achieve the goal, we continue our work.
# code also comes from class
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Scatter, Marker, Layout, layout,XAxis, YAxis, Bar, Line
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=layout.XAxis(title='Principal components'),
yaxis=layout.YAxis(title='Explained variance ratio'))
})
pca = PCA(n_components=4)
X_pca = pca.fit(X)
plot_explained_variance(pca)
From the plot above, we can find the most large principal component is about 0.2, and the smu of the largest two components is lease than 4, which is not very good. If we want to get much clear representation, we need to find another powerful way. After google, we find UMAP is a good way. It can effortlessly process large datasets and high-dimensional data. Also,it combines the power of visualization with the ability to reduce the data dimension. In addition to retaining the local structure, it also preserves the global structure of the data. UMAP maps nearby points on the manifold to nearby points in the low-dimensional representation, and does the same mapping for far points. Therefore, we use UMAP here.
import umap
from sklearn.datasets import load_digits
import warnings
warnings.filterwarnings('ignore')
# get the column TenYearCHD as data target
data_target = df.iloc[:,15:]
#standardization
data_mean = data_target.mean()
data_std = data_target.std()
data_fea = (data_target - data_mean)/data_std
# here we use the code which comes from citation
umap_data = umap.UMAP(n_neighbors=5, min_dist=0.8, n_components=3).fit_transform(data_fea.values)
umap_data.shape
It shows that there are 4238 samples, but only two feature columns (instead of the 16 we started with), so UMAP reduces down to 2D.
plt.figure(figsize=(10,8))
plt.scatter(umap_data[:,0], umap_data[:,1])
plt.scatter(umap_data[:,1], umap_data[:,2])
plt.scatter(umap_data[:,2], umap_data[:,0])
From the plot above, our dataset is clustered by 5 groups. The dimensions have been reduced and we can imagine different transform components. The correlation between the transformed variables is very small.We can see that the correlation between the components obtained from UMAP is very small compared to the correlation between the components obtained from PCA. Therefore, UMAP tends to provide better results.